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ABSTRACT 

Cohen in a now classic paper on statistical power, reviewed articles in die 1960 issue of one psychology 
joumai and determined that the majority of studies had less than a 50-50 chance of detecting an effect that 
truly exists in the population, and thus of obtaining statistically significant results. Such low statistical power, 
Cohen concluded, was largely due to inadequate sample sizes. Subsequent reviews of research published m 
other experimental psychology journals found similar results, We provide a statistical power analysis of 
clinical neuropsychological research by reviewing a representative sample of 66 articles from the Journal of 
Clinical and Experimental Neuropsychology, the Journal of the International Neuropsychology Society, and 
Neuropsychology. The results show inadequate power, similar to that for expenmental research, when 
Cohen's criterion for effect size is used. However, the results are encouraging in also showmg that the field of 
clinical neuropsychology deals with larger effect sizes than are usually observed in experimental psychology 
and that the reviewed clinical neuropsychology research does have adequate power to detect these larger 
effect sizes This review also reveals a prevailing failure to heed Cohen's recommendations that researchers 
should routinely report a priori power analyses, effect sizes and confidence intervals, and conduct fewer 
statistical tests. 



J. Cohen's (1962) seminal power analysis of 
the Journal of Abnormal and Social Psy- 
chology seems to have had no noticeable 
effect on actual practice . . . Must we con- 
clude that researchers stubbornly neglect a 
major methodological issue over decades? 
(Sedlmeier & Gigerenzer, 1989). 

. . the almost universal reliance on merely 
refuting the null hypothesis as the standard 
method for corroborating substantive the- 
ories IS a terrible mistake, is basically un- 
sound, poor scientific strategy, and one of 
the worst things that ever happened in the 
history of psychology. (Meehl, 1978, p.8 17.) 

Virtually all clinical neuropsychology research 
involves trying to reject the null hypothesis to 
obtain statistical significance for the effects of 



interest. Cohen (1962) clearly showed that failure 
to consider the expected effect size when design- 
ing experiments can lead to failure in rejecting the 
null hypothesis because of inadequate statistical 
power to detect the effects. One would have 
expected Cohen's demonstration to have led to 
changes in die literature; however, more than 
twenty years later Sedlmeier and Gigerenzer 
(1989) found that Cohen's warnings and sug- 
gested remedies, at least in one journal, had 
produced little to no change. Furthennore, even 
if experiments do have sufficient statistical power 
and succeed in rejecting the null hypothesis and 
obtaining statistical significance for an effect, 
Meehl (1978) concluded that they are still scien- 
dfically unsound if all they do is reject the null 
hypothesis. MeehFs view is shared in essence, if 
with less colorful expression, by many others with 
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respect to psychological research in general (e.g., effect on psychological publications, as Sedlme- is not 5 

Cohen, 1990, 1994; Rosnow & Rosenthal, 1989) ier and Gigerenzer (1989) reported that, not only rate, 

and to neuropsychological and clinical psycholo- was the median power of the articles in the 1984 Cob 

gical research in particular (Barlow, 1986; Cic- volume of the Journal of Abnormal Psychology lical p( 

chetti, 1998; Soper, et al., 1988). Is our research essentially unchanged from the 1960 volume, but gives u 

then unsound and a terrible mistake? The answer that reviews of psychology journals in other fields deal pc 

depends crucially on the word merely in the quote had also found similar inadequate levels of sta- rch. T 

from Meehl above. Critics of blind adherence to tistical power. reviewi 

null hypothesis testing routinely argue that, in The statistical power of a study depends on small 1 

addition to reporting the significant and nonsigni- three variables: the level at which the alpha sample 

ficant comparisons between groups, researchers significance is set. the sample size, and the effect reconu 

should also report and discuss the size of their size that must be detected. How large an effect effect 

effects. Thus, to avoid the criticisms quoted size should one assume must be detected? A (Paul 

above, clinical neuropsychology researchers sample size that would provide poor statistical straint 

must determine what effect size they are expect- power to detect a small effect size (litde differ- place 

ing before conducting an experiment, ensure that ence between the groups) could provide good resean 

their experiments have sufficient statistical power power to detect a larger effect size. Effect size tical 

to detect such effects, and subsequently report and is frequently indexed, following Cohen, as resear 

discuss the effect sizes they actually observed. S^ifii-fiz^/SD for the standardized difference disco^ 

The present study was conducted to determine between two group means, or as r for correlations. resear 

how well current published research in clinical One of Cohen's important contributions was to publis 

neuropsychology is meeting these requirements. propose pracdcally meaningful criteria for these domai 

In 1962 Jacob Cohen reviewed 78 articles in mathematical quantiries. For example, a 'medium' withii 

the Journal of Abnormal and Social Psychology effect size was defined (as S = .50, or r .30) to propq 

(later Journal of Abnormal Psychology) and cal- Represent an effect likely to be visible to the who 1 

culated the average statistical power. Statisdcal naked eye of a careful observer' (Cohen, 1992, lishec 

power is defined as the probability of detecting an p. 156). Cohen (1962) based his stadstical power statis) 

effect ofa certain size that is truly present (i.e., the analysis review on the assumption that experi- (199( 

probability of obtaining statistical significance mental psychological research was likely to be ropsy 

and rejecting the null hypothesis). Cohen discov- concerned with detecting 'medium* sized effects. Tf 

ered, assuming an alpha level of .05 and the Cohen's estimate of a ^medium' effect size to cc 

presence of a medium effect size, that the median appears to have been reasonable, as Sedlmeier samp 

statistical power of the reviewed studies was a and Gigerenzefs (1989) subsequent re-analysis of to pi 

weak .48. Cohen criticized this level of power as the 1960 data found that the mean effect sizes of F 

inadequate, and concluded that the low power in reported in those studies was, indeed, very close 'me<) 

these studies was primarily due to inadequate to Cohen's 'medium' size. Thus, a medium effect previ 

sample sizes. Cohen tried to correct this state of size seems to be an appropriate standard to use in statij 

affairs, most notably by writing a book to help designing and evaluating the statistical power of sizei 

researchers avoid conducting studies that fail to experimental psychological studies. But, is it 75%j 

detect meaningful differences, or that are not an appropriate standard for studies in clinical mat^ 

likely to be replicable, by demonstrating how to neuropsychology? If the effects in clinical neu- stuc^ 

measure effect sizes, conduct statistical power ropsychoiogy are 'medium' in size, then research det^ 

analyses, and choose appropriate sample sizes with small samples in this field will have unac- repcj 

(Cohen, 1969, 1988). A number of studies have ceptably low power. If, however, the effect sizes resi| 

extended and simplified the utilization of statis- are larger, then small samples may provide ade- Co^ 

tical power analyses in research since the pub- quate statistical power. Cohen's medium effect of i 

lication of Cohen's original paper (Sawyer & size is likely to be too small for applied clinical choj 

Ball, 1981, Goldstein, 1989). Nevertheless, the research, since this effect size would only allow a | proj 

efforts of Cohen and others apparently had little 60% correct diagnostic classification rate, which | stu<j 

I I 

I: I 

I I 
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is not substantially better than the 50% chance 
rate. 

Cohen*s widely quoted investigation of statis- 
tical power in social and abnormal psychology 
gives us reason to be concerned about the statis- 
tical power of clinical neuropsychological resea- 
rch. The median sample size in the studies 
reviewed by Cohen was 68 and yet this was too 
small to yield adequate power. In fact, a total 
sample size of 102 is required to achieve Cohen's 
recommended power of .8 for detecting a medium 
effect with a two-group p < .05 one-tailed t-icsi 
(Paul & Erdfeldcr, 1992). Since practical con- 
straints such as the availability of subjects often 
place limitations on sample sizes in clinical 
research, one might be concerned that the statis- 
tical power of clinical neuropsychological 
research would be even lower than the power 
discovered in previous reviews of experimental 
research. Although reviews of statistical power in 
published studies have been carried out in other 
domains of psychology, we know of no such study 
within the field of neuropsychology. Thus, we 
propose to follow the lead of previous authors 
who have investigated the extent to which pub- 
lished research was conducted with sufficient 
statistical power, and, to paraphrase Rosenthal 
(1990), ask 'How are we doing in clinical neu- 
ropsychology?' 

The primary purpose of the present study was 
to conduct a descriptive statistical review of a 
sample of clinical neuropsychology research and 
to provide data on: (a) The statistical power 
of published research studies to delect the 
^medium' effect size that has been assumed in 
previous reviews of experimental research; (b) the 
statistical power of these studies if larger effect 
sizes are assumed (e.g., large enough to allow 
75% correct classification); (c) the actual esti- 
mated population effect sizes derived from these 
studies; (d) the statistical power of these studies to 
detect the actual effect sizes that they were 
reporting; and (e) the actual sample sizes. These 
results will allow an evaluation, similar to 
Cohen's (1962) original review, of the adequacy 
of statistical power in current clinical neuropsy- 
chological research. A second purpose was to 
provide data on the percentage of published 
studies that actually report power analyses and 



effect sizes (versus those that appear to be ^merely 
refuting the null hypothesis'), and on the number 
of statistical tests conducted. 



METHOD 

The approach of this study was to review a re- 
presentative sample of published articles in the field 
of clinical neuropsychology that reported com- 
parisons of a cHnical group to either another clinical 
group or to a normal comparison group. Articles 
from the Journal of Clinical and Experimental 
Neuropsychology, the Journal of the International 
Neuropsychology Society, and Neuropsychology 
were reviewed to determine: the observed statistical 
power for three different effect sizes, the actual effect 
size, the statistical power given the actual effect size, 
and the sample size. The number of statistical tests 
completed, the use of any alpha corrections, and 
whether the article mentioned power and effect sizes 
were also tabulated. 

Calculations of statistical power were made on 
what was deemed to be the primary hypothesis of the 
study. Post hoc power analyses were conducted 
using the GPOWER program (Paul 8c Erdfelder, 
1992) with the .05 value of alpha assumed. In some 
studies, alpha correction procedures were used (e.g., 
Bonferroni) and these were noted, but were not 
incorporated into our calculations. Power was 
calculated in relation to three effect sizes: d~ ,5 
(Cohen*s 'medium*), .8 (Cohen^s 'large*), and 1.35. 
This latter value would allow a correct diagnostic 
classification percentage of 75%, assuming normal 
distributions, equal sample sizes, and sensitivity 
equal to specificity (Cohen, 1988, pp. 21-23). We 
chose the 75% value as the lowest classification 
accuracy that we thought would be clinically useful. 
(Higher accuracy would require still larger effect 
sizes and would result in higher estimated statistical 
power.) In contrast, a medium effect size of d~ ,5 
corresponds to a correct classification of only 60%, 
which in our view would not be of substantially 
greater clinical utility than the 50% chance level. 

Thirty-three articles were reviewed from each of 
the 1998 and the 1999 volumes for a total of sixty- 
six articles. The first article in the first issue for each 
journal was the starting point and articles were 
reviewed in sequence until 1 1 articles were com- 
pleted for each particular journal and year. No 
. articles were excluded unless by necessity, namely 
because of the lack of any statisUcal analyses in non- 
experimental studies, the use of only descriptive 
statistics, the use of non-clinical groups, a predicted 
non-significant result, our inability to determine 
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which single statistical test corresponded to the 
major hypothesis (or to determine what the major 
hypothesis might be), or a study that did not uhi- 
mately relate a two-group comparison to the primary 
hypothesis. A few articles used ANCOVA when the 
covariate was significantly different between the pre- 
existing groups, an inappropriate application (Evans 
& Anastasio, 1968), and these studies were also 
excluded. An estimated population effect size (using 
the delta parameter) was calculated using the 
formulae given by Richardson (1996) from data 
provided in the original articles. Power for the effect 
size implied by the data was then recalculated using 
GPOWER. 



RESULTS 

Table 1 shows the results of the power analyses. 
Medians are reported instead of means since some 
studies had relatively large samples {e.g., 
A^=187) and hence large power. Statistics for 
the major hypotheses are presented in Table 2. 
Table 3 shows statistics for the total number of 



statistical tests performed and for the total number 
of participants in each of the reviewed studies. 
Table 4 shows the percentage of the reviewed 
studies using or reporting alpha level corrections, 
power analyses, or effect sizes. 



DISCUSSION 

The median statistical power to detect a medium 
effect size .50) for the 66 clinical neuropsy- 
chology research studies examined was .50, which 
is remarkably similar to the previous results for 
experimental studies. Cohen's 1962 review repor- 
ted a mean power of .48 for an assumed medium 
effect size, and Sedlmeier's and Gigerenzer's 
1989 review reported a mean power of .50 for 
an assumed medium effect size. Thus, using si- 
milar criteria to those that Cohen and others have 
used to evaluate experimental psychology litera- 
ture, clinical neuropsychology research appears to 
have equivalent power. This level of power, is. 



Table 4. 



Table 1. Power Analysis Results for 66 Clinical Neuropsychology Studies Published During 1998 and 1999 in 
Three Major Journals. 



Assumed effect size Median 
(Cohen's d) 




Min 




Max 


M 


(SD) 


0.50 .451 
O.80 .785 
1.35 .990 




.170 
.310 
.610 




,944 
.999 
.999 


.500 
.768 
.957 


(-201) 
(.189) 
(.080) 


Table 2. Statistics for the Major Hypotheses in the Reviewed Studies. 


Statistic 


Median 




Min 


Max 


M 


(SD) 


Population effect size (Cohen's d) 
Population effect size (100 x r^) 
Power for population effect size 


0.911 
16.9 
0-93 J 




0.020 

0.0 

0.0 


5.312 
87.6 
0.999 


M51 

22.8 
0.849 


(0.837) 
(17,7) 
(0.199) 


Tabic 3. Statistics for the Total Number of Statistical Tests Performed and for the Total Number of Participants in 
each of the Reviewed Studies. 


Statistic Median 




Min 




Max 


M 


(SD) 


Number of tests 24 
Number of participants 39.5 
Participants per test 1.45 




3 

10 

0.26 




139 
187 

20.00 


29.5 
53.68 
2.84 


(21.91) 
(38.49) 
(3.36) 



Procedurel 

Alpha levj 
Power ana 
Effect siz^ 



Note, 



mey 
expi!^ 



however^ ; 
medium^ 
Sedlrfj 
lated ac 
random;! 
of artici 
Cohen | 
observei 
contrast 
that we| 
mean e| 
which i^ 
(1988) j 
statistic^ 
effect si 
almost \ 
CohenJ 
effect d 
.93. Fu^ 
clinical 
normal 
separate 
the ml 
^= I. 
of the ! 
expect 
case, I 
to dete 
for th^ 
Th^ 
sycho| 
larger^ 
reseai^ 
tistica^ 
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Table 4. Percentage of ihe Reviewed Studies Using or 
Reporting Alpha Level Correction, Power 
Analysis, or Effect Size. 



b 

•r 

n 

s 

T 



Procedure 


Frequency 


Alpha level corrections 


18.2% 


Power analysis 


3.0% 


Effect size 


9.1% 



Note. Effect size could be by use of any accepted 
method, for example Cohen^s d, percentage of 
explained variance (r^), r, or statistics on 
classification rates. 



however, inadequate if the effect sizes really are 
medium sized. 

Sedlmeier and Gigerenzer (1989) also calcu- 
lated actual observed sample effect sizes for a 
random selection (N = 20 papers) in the collection 
of articles reviewed in their own paper and by 
Cohen in 1962. They reported that, for both, the 
observed mean effect size was medium. In sharp 
contrast, the clinical neuropsychological studies 
that we reviewed here reported results yielding a 
mean estimated population effect size of 5= .88, 
which is a Marge* effect size according to Cohen's 
(1988) criteria. As seen in Table 1, the median 
statistical power of these studies to detect a large 
effect size {6 ~ .80) was .79, which happens to be 
almost exactly the power value recommended by 
Cohen. The median statistical power to detect an 
effect of the size actually found by each study was 
.93. Furthermore, if one were to assume that for 
clinical utility the means of two populations (e.g., 
normal versus pathological) would need to be 
separated by at least 1.35 standard deviations on 
the measured variable (population effect size 
^~ 1.35) to allow better than 75% non-overlap 
of the distributions (Cohen, 1988, p. 22), for 75% 
expected correct diagnostic classification of a new 
case, then these studies had a median power of .99 
to detect effect sizes that are clinically meaningful 
for the individual diagnosis. 

Thus, our review suggests that clinical neurop- 
sychology research typically deals with much 
larger effect sizes than does experimental 
research. Furthermore, the sample sizes and sta- 
tistical power of published Clinical neuropsycho- 



logical research appear to be adequate to detect 
the magnitude of effect sizes that this research 
actually confronts. Clinical neuropsychology 
research, therefore, appears to be in a better 
position to detect significant differences than the 
experimental research reviewed by Cohen (1962) 
and Sedlmeier and Gigerenzer (1989). Neuropsy- 
chological research with relatively small samples 
will often be a reasonable undertaking, while such 
an investment of time and energy would typically 
be an unwise expenditure in experimental 
research where smaller effects must be reliably 
detected. In response to the question 'How are we 
doing in clinical neuropsychology?*, Rosenthal's 
(1990) answer to his similar question seems 
appropriate, namely ^better than we might have 
thought', at least with respect to whether the 
sample sizes are large enough to provide accep- 
table statistical power to detect the effect sizes 
typically involved. 

Lest we become complacent, however, there 
are still reasons to be concerned about power and 
effect size. Few studies appear to conduct a priori 
power analyses; only 3% of the reviewed studies 
reported such an analysis. Before gathering data, 
we recommend that researchers estimate, or at 
least assume, an effect size they need to detect and 
then conduct an a priori power analysis to deter- 
mine the sample size needed for acceptable sta- 
tistical power (e.g. .8). The effect sizes found in 
the reviewed studies suggest that it would be 
appropriate to assume Cohen's *large' effect size 
(^—.8) for theoretically motivated clinical neu- 
ropsychological research. For applied clinical 
neuropsychological research, such as to deter- 
mine whether a measure might be appropriate 
for reliable diagnostic classification, very large 
effect sizes might be appropriate (e.g., 6 > 1.35). 
Furthermore, research articles should provide 
information about the obtained effect sizes; only 
9% of the reviewed clinical neuropsychological 
studies explicitly reported the effect size of their 
results. (Sufficient information was, of course, 
available in the articles to allow us to calculate 
effect sizes.) This could be done by calculating 
and reporting 8, or a similar measure, so that other 
researchers could use this value as an empirically 
based estimate of effect size for a priori power 
analyses as they design new studies with similar 
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samples and measures. Effect size can also be 
reported in other ways to serve other purposes. 
Reporting confidence intervals, especially graphi- 
cally, for measures in each group can be helpful in 
the communication of results and can facilitate 
theoretical interpretation. Confidence intervals, 
while common in other areas of psychology and 
in the physical sciences, were not reported for any 
of the reviewed studies. Another effect size index, 
and one that might be especially relevant and 
understandable for clinical neuropsychologists, 
is the inferred population non-overlap percentage 
(e.g., 75% when L35, Cohen, 1988), which 
would provide information about whether the 
effect is large enough to be of diagnostic value. 
The low incidence of reporting of a priori statis- 
tical power analyses and obtained effect sizes 
indicates that the criticisms of Cohen, Meehl 
and others about research that 'merely' focuses 
on statistical significance continue to he relevant 
to clinical neuropsychology research. Our recom- 
mendations to conduct a priori statistical power 
analyses and report effect sizes are not new (cf., 
Cohen, 1962). Power analyses can readily be 
carried out using Cohen's tables (Cohen, 1988) 
or with a variety of currently available software 
such as GPOWER (Erdfelder, Paul, & Buchner, 
1996; Paul & Erdfelder, 1992), SAMPLE POWER 
(SPSS, 1999), or other products. Pollowing these 
recommendations will result in research that is 
both more sound and more useful. 

The number of statistical tests conducted and 
reported is also an area of concern. We were sur- 
prised at the high number of statistical tests that 
were reported in the reviewed studies (median 21, 
maximum 139); 82% of the time with no control 
over the experimentwise Type 1 error rate. The 
problem with multiple comparisons is that if the 
alpha level (e.g., for p < .05) is kept constant for 
each test regardless or how many tests are per- 
formed, then the likelihood of making a TVp^ 1 
error (falsely obtaining statistical significance 
because of random variation in the data) on at 
least one of the comparisons becomes much 
larger. With 139 comparisons, one would expect 
7 to be significant at the p < ,05 level by chance 
alone. The use of a Bonferroni adjustment to the 
alpha level would control for this problem, but at 



the cost of greatly reduced power, which makes 
the likelihood of obtaining statistical significance 
for any real effect much less likely (e.g., by using 
p < .(X)05 to adjust for 139 comparisons). We, like 
Cohen, did not adjust alpha levels to control for 
multiple comparisons. If we had, the statistical 
power results would have been substantially 
lower, and the conclusion would have been that 
sample sizes were too small. While Bonferroni 
adjustment of alpha level is not an ideal solution 
for the multiple comparison problem (cf.. Roth- 
man, 1990), conducting a high number of statis- 
tical tests with no clear a priori hypotheses is not 
appropriate either. Our view is that research using 
the null hypothesis statistical testing model should 
be reported with explicit statements describing 
the major hypotheses of the study and with the 
statistical test for each hypothesis (e.g., a parti- 
cular Mest, ANOVA main effect, or interaction) 
clearly stated or implied. If a report identifies a 
few statistical tests as crucial, then the presence of 
a few dozen additional statistical tests (e.g., of un- 
predicted interactions or of control or nuisance 
variables) would not indicate a problem with 
multiple comparison Type I error rate inflation 
in the testing of the major hypotheses. Unfortu- 
nately, in conducting this review we often found it 
difficult, and sometimes impossible, to determine 
what the authors' main hypotheses were, and/or to 
determine which of the many relevant compar- 
isons they considered the most important test of a 
hypothesis. Por example, it might not be clear 
whether some of 21 effects in an ANOVA were 
non-predicted or if the intent of the authors was to 
lest them all on an equal basis. Cohen addressed 
this problem with his Mess is more' recommenda- 
tion (Cohen, 1990). In our view, articles should 
clearly identify the crucial few tests of the major 
hypotheses (and not use an alpha adjustment for 
these) and distinguish these from what are essen- 
tially post hoc tests of non-predicted effects (for 
which alpha adjusted tests should be used). In 
addition, greater use of multivariate test methods, 
to allow a single test instead of many univariate 
tests, would also help to address the multiple 
comparison problem. 

There are certain limitations to this study. As 
with meta-analyses, one unavoidable limitation of 
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a statistical power review is the *file-drawer' 
problem, which supposes that many non-signifi- 
cant results are never reported (Rosenthal, 1984), 
This could artificially inflate estimates of effect 
size. It could also be argued that this paper 
neglects important effect size differences within 
neuropsychology. It is true that different popula- 
tions will tend to have different effect sizes. 
Hence, if we had, for example, compared the 
studies with traumatic brain injury populations 
to the epilepsy surgery papers we may have 
uncovered different effect sizes that were aver- 
aged out by examining neuropsychological stu- 
dies as a whole. However, the aim of this paper 
was to enable comparisons between statistical 
power and effect size in clinical neuropsychology 
to the previous statistical power reviews that have 
been conducted in other fields of psychology 
(Cohen, 1962; Sedlmeier & Gigerenzer, 1989). 

Hypothesis testing is directed toward YES/NO 
decision making. This is an appropriate format for 
certain kinds of research (e.g., drug studies or in 
agronomy - the discipline for which hypothesis 
testing was created) and not for others. Since 
neuropsychological research is not solely directed 
to making decisions, less exclusive reliance on 
hypothesis testing would be reasonable. Neurop- 
sychological researchers might consider asking 
themselves the question *Am I more concerned 
with 'What is the case?' or with ^What is wise to 
do?' (Bakan, 1966, p. 435). If the answer is the 
former, then approaches other than hypothesis 
testing might be investigated. More generally, 
we concur with reconmiendations for greater 
rehance on descriptive statistics, graphical meth- 
ods, and the routine reporting of confidence 
intervals and effect sizes (Cicchetti, 1998; Cohen, 
1994). 
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